Star Hotels Project

Context

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Loss of resources (revenue) when the hotel cannot resell the room. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin. Human resources to make arrangements for the guests. Objective The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary


Observations

Observations

observations

Summary of Dataset

Observations

Observations

Univariate Analysis

Observations

Observations

Nothing more than what has been already said

Bivariate Analysis

Observations

There are no correlations between the numeric variables

Observations

we replace canceled and not canceled by 1 and 0, respectively, as we are focusing on Cancellation motivations

Observations

Here we try to detect if the number of adult, children, week-end or week night could be a differentiate factor for the 'Cancelers'

We try to determine a generic profile of cancelers with the mode function

Cancelers prefers Meal Plan 1, Room Type 1, market segment type Online and is a newcomer (no cancellation history or previous booking not canceled). They often reserve un July and mostly in 2018.

Observations

Observations

Statistical treatment : are the means of leadtime and average price signicantly higher for cancelled than not canceled ?

Null and alternative hypothesis

Let 𝜇 be the mean lead time

𝜇1 : mean lead time for canceled

𝜇2 : mean lead time for not canceled

The nul hypothesis is equality of means, in other words there is no difference in lead time between canceled and not canceled 𝜇1 = 𝜇2

The alternate Hypothesis is the lead time is higher for canceled than not canceled 𝜇1 > 𝜇2

p_value is below the 0.05 cut so we can reject the null. The alternative hypothesis is validated. The mean lead time for canceled is higher (143 days) than the mean lead time for not canceled (60 days).

Null and alternative hypothesis

Let 𝜇 be the mean average price

𝜇1 : mean average price for canceled

𝜇2 : mean average price for not canceled

The nul hypothesis is equality of means, in other words there is no difference in average price between canceled and not canceled 𝜇1 = 𝜇2

The alternate Hypothesis is the avegrage price per room is higher for the canceled than the not canceled 𝜇1 > 𝜇2

p value is below the 0.05 cut ratio. Thus the null hypothesis is rejected. The alternative hypothesis is validated. There is a difference between the average price per room between canceled (117) and not_canceled (104)

Observations

Conclusions

Observations

As a conclusion, canceled seems concern more hollydays vacations and so families.

Outliers Treatments


Model Prepration

we replace canceled and not canceled by 1 and 0, respectively, as we are focusing on Cancellation motivations

Building the Logistic Regression Model

Sklearn Model Results

Checking performance on train set

Checking performance on test set

Observations

Statsmodel results

Observations

Additional Information on VIF

Observations

Removing market segment type Online

Observations

Observations

Note: The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

Now that every variable as a pvalue < 0.05 so we'll consider the features in X_train2 as the final ones and lg3 as final model.

Coefficient interpretation

At this point the more adults, week end nights, lead time and avg price, the more the cancellation probability increase. On the negative side the no of children, required car parking race, time of arrival either year or month. This indicate more a corporate profile. This is confirmed by the market segment influence, Corporate or Offline, which tends to decrease the probability of cancellation. This confirm our hypothesis : families have a higher cancellation probability than the others

Converting Coefficient to odds

Coefficient interpretation

Logistic model is giving good performance on training set

The optimal Threshold seems to be at 0.42

Creating a confusion Matrix adapted with the optimal threshold

The 0.32 Threshold looks more attractive in terms of performance

Checking Performance on the test set

Droping the column that were drop on the training set

Using model with default threshold

Using model threshold 0.42

Using model thresholds 0.42

Model Performance Summary

Conclusions

The best model looks to be the 0.32 threshold one with the highest recall

Recommandations


Decision Tree Approach

Checking Performance on training set

Checking Performance on test set

As anticipated there is a huge difference in performance between the train and the test set, suggesting an overfitting of the model

Visualizing the model

According to the model lead time is the most important parameter to predict cancellation

Reducing Overfitting

Using GridSearch for Hyperparameter tuning of our tree model

Checking Performance on training set

Checking performance on test set

Visualization of Decision Tree

lead time is the first parameter that influence cancellation and average price per room comes in second according to the Decision Tree model

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

Maximum value of record correspond to a ccp alpha value of 0.0125 However a closer to 0 ccp_alpha may have sense as recall is higher

Checking performance on training set

Checking performance on test set

Recall is the same in train and test set, which is a good point

However the model overfit with a 0.00125 ccp_alpha. This makes us conclude that we should use a close to 0 ccp_alpha

Checking with ccp_alpha = 0.002

Checking performance on training set

Checking performance on test set

Model looks clean, non overfit

Observations

Model comparison show an improvement from the first decision tree model and the final one. This make us confortable with the results obtained. Besides, these results confirmed all previous works. The redondance of these results strengthened the conclusions.

Conclusions

As seen trhough EDA, Logistic Regression and Decision Tree approach, the canceled profile can be define by these key elements :

Some elements are underlined in some approach and not in others :

All in all some elements suggest that the cancelers (canceled) have more a "family" profile than non cancelers (or not canceled). Non cancelers, at the opposite, looks more corporate customers.

Business Recommandation

These measures may help to avoid cancellation, facilitate room repositionning when a cancellation appear and reduce cancellation probability